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Abstract 


We present an efficient classical algorithm for training deep Boltzmann machines 
(DBMs) that uses rejection sampling in concert with variational approximations 
to estimate the gradients of the training objective function. Our algorithm is in¬ 
spired by a recent quantum algorithm for training DBMs m. We obtain rigorous 
bounds on the errors in the approximate gradients; in turn, we find that choosing 
the instrumental distribution to minimize the a = 2 divergence with the Gibbs 
state minimizes the asymptotic algorithmic complexity. Our rejection sampling 
approach can yield more accurate gradients than low-order contrastive divergence 
training and the costs incurred in finding increasingly accurate gradients can be 
easily parallelized. Finally our algorithm can train full Boltzmann machines and 
scales more favorably with the number of layers in a DBM than greedy contrastive 
divergence training. 

1 Introduction 

In 2002, Hinton provided the first efficient algorithm for training Boltzmann machines ||2l, a type of 
stochastic recurrent neural network with undirected edges, called contrastive divergence (CD). Due 
to the emergence of CD, Boltzmann machines, specifically restricted Boltzmann machines (RBMs) 
and their deep layered counterparts (DBMs), have become standard tools for solving problems in 
vision and speech recognition Ellll. Despite the remarkable successes that contrastive divergence 
has achieved, there are a number of theoretical and practical drawbacks to the use of the algo¬ 
rithm. A major drawback of contrastive divergence training is that it implicitly adds directionality 
to the edges in the graphical model, which lessens any advantages Boltzmann machines may have 
over feed-forward neural nets. On a more practical level, parallelism cannot be used to increase 
the accuracy of contrastive divergence training which means that only the lowest-order (and least 
accurate) contrastive divergence approximation is used. 

Recent work in quantum computing has revealed a class of training algorithms that use a quantum 
form of rejection sampling to overcome these problems [Tl. The approach hinges on refining quan¬ 
tum states that crudely approximate the joint and conditional probability distributions for the units 
in the Boltzmann machine into ones that closely mimic them. This process is expected to yield accu¬ 
rate and efficient approximations to the distribution if sufficiently strong regularization is used in the 
training process EISl. The gradients of the training objective function (the average log-likelihood 
of the BM producing the training data) are then estimated by either sampling from the resulting 
quantum states or by using techniques like quantum amplitude amplification and estimation. Am¬ 
plitude amplification results in a quadratic speedup with respect to the acceptance probability of 
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Greedy CD-fc training 

IRS training 

Quantum training 

Complexity 

0{NepochNtri,ink£e) 

0{^ epoch ^ train ) 

^ (-^epoch-^t rain^ V^) 

Depth 

O ( A^epoch ke log [iVtrainf J) 

0{Nepoch l0g[WtrainKf J) 

^(-^epoch log[A^train^^]) 


Table 1; Time complexities for training an £-layer DBM with £ edges using greedy contrastive 
divergence, our algorithm and quantum training. The latter two algorithms are not restricted to 
DBMs. Depth is the time needed for a cluster with an unbounded number of computational nodes. 


the rejection step and amplitude estimation quadratically reduces the number of training vectors 
(at the price of quadratically worsening the scaling with £) |[T]. These advantages are summarized 
in ITable T1 

This quantum approach has several signihcant features. First, it is a very natural method for training 
a Boltzmann machine using a quantum computer because it only involves state preparation and 
measurement. Second, it does not explicitly depend on the interaction graph used: it can be used to 
train full Boltzmann machines rather than just DBMs. Third, it is mathematically easy to verify that 
the approximate gradients converge to the true gradients in the appropriate limit. Finally, it does not 
have a well known classical counterpart, unlike many existing quantum machine learning results. 

Unfortunately, quantum computers are currently limited to tens of quantum bits, which means 
that lH] can only be used to train impractically small Boltzmann machines using present-day hard¬ 
ware. Consequently, determining a classical analogue of the quantum approach would embue classi¬ 
cal training methods with the quantum advantages. The challenge is whether the classical analogue 
would still result in efficient training. We present such a classical analogue in this paper. 

We call our approach Instrumental Rejection Sampling (IRS). It retains all of the theoretical advan¬ 
tages of the quantum algorithm, while providing practical advantages for training highly optimized 
deep Boltzmann machines in the presence of regularization. It is worth noting that our approach is 
not specihc to Boltzmann machines: it also applies to more general classes of undirected graphical 
models with latent variables. 

Flere we investigate the quality of the gradients yielded by this method and provide theoretical and 
numerical evidence that training using IRS is not only efficient, but it may also convey practical 
advantages for training certain classes of deep Boltzmann machines. Since these insights stemmed 
from recent progress on quantum machine learning, our work underscores the value of investigating 
quantum paradigms for machine learning even in the absence of large-scale quantum computers. 


2 Instrumental Rejection Sampling 


The main insight behind our result, and stemming from III, is that variational approximations to the 
Gibbs state can be used to make training with rejection sampling much more efficient than it would 
initially appear. This result is conceptually related to work by Murray and Gharhramani Cl in the 
context of MCMC algorithms; whereas, our results hold for rejection sampling and also show how 
to choose the approximation to minimize the error in the sample distribution. We present the general 
theory of IRS here and apply it to training DBMs in |Section3| 

Rejection sampling seeks to draw samples from a distribution P{x)IZ := P{x)/'Y^^P{x) that 
cannot be sampled from directly by sampling instead from an instrumental distribution, Q, and 
rejecting the samples with a probability P{x)/ZkQ{x). Here k is a normalizing constant introduced 
to ensure that the rejection probability is well dehned. In other words, we draw samples from Q{x) 
and reject them with probability 

Praccept(x|(3(a;), K, Z) = , (1) 


until a sample is accepted. This can be implemented by drawing y uniformly from [0,1] and accept¬ 
ing x\fy< Praccept(a:|Q(a;), K, Zq). 


In applications such as training Boltzmann machines, the constants needed to normalize (1) are not 
known or may be prohibitively large for some x. This can be addressed using a form of approximate 
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p(x') 

rejection sampling where we use ka < and Zq ^ Z such that ZqkaQ{x) ^ ^ ™ some set 
of configurations which we call bad. The approximate rejection sampling algorithm then proceeds 
as the precise rejection sampling algorithm except that the sample x will always be accepted if 
X € bad. This means that the samples yielded by approximate rejection sampling are not precisely 
drawn from P/Z. Error estimates for this sampling process are given below. 

Theorem 1. Let Q{x) : x G be an efficiently computable probability distribution that can 
be efficiently sampled from. Assume that P{x) : x G can be efficiently computed and 
P{x) / ZqQ{x) > ka'^ X G bad C where 

{P{x) - kaZqQ{x)) < eZ. 

fcGbad 

There then exists an efficient classical algorithm that samples from a distribution P such that 
'Zx V P{x)P{x) > s/Zfl — e). The probability of accepting a sample is at least 


Our proof of this theorem uses quantum techniques and is given in the appendix. Theorem 1 shows 
that if the conditions normally required of rejection sampling are not met then an approximate sam¬ 
pling algorithm exists that is promised to be close to the distribution if the chosen value of ka is 
sufficiently large. Since the acceptance rate of the sampling algorithm scales inversely with ka, 
elementary choices of Q (such as the uniform distribution) are unlikely to efficiently produce accu¬ 
rate samples from P{x) jZ for polynomially large ka. We call our approach Instrumental Rejection 
Sampling training to emphasize that we use a non-trivial Q to draw the samples. 


In order to minimize the error in TheoremT] we want to choose Q to minimize an appropriate 
divergence between Q and PfZ. The choice of divergence that Q minimizes is by no means unique. 
The result of HI chooses Q to be a product distribution that minimizes ZGL{Q\\P/Z) known as 
the mean-field distribution; however, this choice only provides a polynomial penalty for using an 
exponentially poor approximation to the tail probability. 


We can address this problem by using the method of El to find Q to minimize a divergence that does 
not de-emphasize these tail probabilities. This method finds Q that minimizes DafP/Z\\Q) where 

n t II t + (1 - a)9(a:) 

Da{p\\q) = - 7-, -^-■ (2) 

a(l — a) 

Note that limQ_j.o Da{p\\q) = ZA^{q\\p) and hence the method of El generalizes the mean-field 
approximation. The following lemma shows that choosing Q = Qa =2 to be the Q that minimizes 
D 2 {P/Z\\Q) also minimizes an upper bound on the rejection sampling etTor. 

Lemma 1. Let Q{x) : x G and P{x)jZ : x G be probability distributions then 

{P{x) - KAZqQfx)) ( ZD 2 {P/Z\\Q)\ 

Z \ kaZq ) ’ 

where D 2 {P/Z || <5) = ^ ^ ~ Q{x))'^/Q{x) is the a = 2 divergence and we consider 

the asymptotic regime where D 2 (PlZ || Q) ^ 1. 


Proof. There are two cases, bad = 0 and bad 0. If we have the former case then 
Sa;Gbad (Pi^) ~ ^^aZqQ{x)) = 0 and the result trivially follows. Now let us focus on the case 
where the set is non-empty. Since Q{x) is a probability distribution and ka> ^ 

Pjx) - KaZqQ{x) ^ E.gbad^(d .3, 

Z ~ Z ' ^ 

In order to estimate the probability, it is helpful to think of the problem as a sampling problem for the 
random variable P{x)/Q{x) where x ^ P{x)/Z. The desired probability is then bounded above 
(using Markov’s inequality) by 


P{x)/Z = Pr {P{x)/ZqQ{x)>ka) 

.Gbad X^P(.)/Z 

< Pr {P{x)/ZqQ{x) > ka) 

Xr^P{x) {Z 

^ E(P/g) _y^ P\x) 

~ kaZq ^ ZQ{x)kaZq 


(4) 
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Rewriting the a-divergence as a sum and using the fact that PjZ and Q are normalized to 1 leads 
to 


D^{PIZ\\Q) 


P\x) 

2 4 - Z’^Qix) 


1 

2' 


(5) 


Thus 1(4)1 and |(5)| imply 


P{x)_ ^ ^2D2(P/Z||Q) + 1) 


Z kaZq 

which completes the proof under the assumption that D 2 {P/Z\\Q) ^ 1. 


(6) 

□ 


Lemma 1 shows that choosing Q to minimize D 2 will asymptotically minimize the approximation 


error. In contrast, no corresponding result has been rigorously shown for KL((3||P/Z). 


3 Approximate training of Boltzmann machines using IRS 


We now show how instrumental rejection sampling can be applied to training Boltzmann machines. 
Boltzmann machines model the training data as an Ising model in thermal equilibrium with its 
environment. The goal of training is to adjust the parameters of the Ising model to maximize the 
likelihood that the observed training data would emerge from the thermal distribution in the system. 

A Boltzmann machine consists of spins (bits) which are composed of n„ visible units and rih hidden 
units Eia. The visible units represent the input and output of the model and the hidden units 
are used to generate appropriate correlations between the features in the input and output. The 
correlations can be visualized as edges in a graphical model between pairs of hidden and visible 
units. In general, the underlying graph can be a complete graph; however, in practice a layered 
bipartite graph is usually preferred since it admits efficient training. 

The restricted Boltzmann machine (RBM) consists of two layers of units, where each layer consists 
of either exclusively hidden or exclusively visible units. Edges are restricted to inter-layer correla¬ 
tions. The unnormalized probability of a given configuration of hidden and visible units is 

P{v,h)=e-^^'’'’^\ (7) 

where P/Z for Z = u is the normalized joint probability distribution and 


E{v, h) = — biVi — djhj — WijVihj, (8) 

i 3 hJ 

for binary units Vi G Z 2 and hj G Z 2 . The vectors b and d are called biases and set the probability 
of a unit being zero or one irrespective of the values of the adjacent units. The weights w provide 
energy penalties for two adjacent units taking the value 1. 

Training the model formally involves maximizing the average log-likelihood of the training data 
emerging from the model. This can be thought of as an optimization problem with objective function 

log - log(Z) - (9) 


Oml = — 


At, 


E 


xGXtr^ 


where Xw'^w/2 is a regularization term introduced to combat overfitting. 


Unfortunately, calculation of the objective function is in general intractable as calculation of the 
partition function Z is complete for non-planar Ising models. Nonetheless, the gradient of 
Oml can be estimated without knowing the log-partition function 121 . 


dOuL 

dwi^j 


(Uj Lj) model XWij. 


( 10 ) 


Here the expectation value over the data refers to the average of the corresponding pair of visible 
and hidden units given the constraint that the visible units are clamped to the values of the individual 
training vectors (the hidden units are allowed to vary and take values given by the Gibbs distribution). 
The expectation value over the model corresponds to the average value of the product of the pair of 
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units for the Ising model when the visible units are not constrained to take values in the training set. 
The derivatives of 0ml in the directions of the biases are given in the appendix. 


Despite the innocuous appearance of the gradient in (10) it can still be challenging to compute be 


cause the expectation values require drawing samples from a Gibbs distribution described by (7) 
which is itself an NP-hard task in general. This intractability is sidestepped through the use of 
approximate methods such as contrastive divergence Elia. CD training is efficient, but it has theo¬ 
retical and practical shortcomings ifTOll that we aim to address with Instrumental Rejection Sampling. 


3.1 Approximate Gibbs distributions 

Many methods can be employed to provide an estimate of the Gibbs distribution P/Z. Perhaps 
the simplest is the mean-field approximation (MF), which takes Q to be a factorized probability 
distribution over all hidden and visible units in the graphical model. Among these factorized distri¬ 
butions, Q is taken to be the one that is closest to Pr(?;, h) = jZ, where closest means that 


it minimizes KL{Q \ \e ^/Z).\n particular, for an RBM 


QuFiv, ^) = n 

i=i 




l-hk 


k=l 


where 




( 11 ) 


The optimal mean-field parameters can be approximated by solving (11) using fixed point iteration. 
The MF approximation for generic Boltzmann machines takes a very similar form 15]. Note that 
here Q{v, h) is a product distribution, but multi-modal or structured mean-field approximations can 
be used instead in cases where the MF approximation is expected to break down Q. 


Although the mean-field approximation is expedient to compute, [Lemma T suggests that choosing 
Q to minimize D 2 will asymptotically minimize the algorithm’s complexity. We refer to the product 
distribution that D 2 {P/Z\\Q) as Qa= 2 - The minimization strategy used to find Qmf does not 
work for Qa =2 because D 2 does not contain logarithms and hence more general methods, such as 
fractional belief propagation, are needed. Fractional belief propagation works by choosing Q to 
variationally minimize an upper bound on the log-partition function that corresponds to the choice 
a = 2. The algorithm is explained in detail in ifTTl I^. 

In order to maximize the probability of success, it is useful to have an estimate of the partition 
function Z. The log-partition function can be estimated for any product distribution Q using 

log(^) > log(ZQ) := V Q{x) log {) =-{E)- H[Q{x)], (12) 


Q{x) 

where H \Q{x )] is the Shannon entropy of Q{x) and {E) is the expected energy in the state Q. 
Equation (12) holds with equality if and only if Q{x) = /Z. The slack in the inequality is the 

Kullback-Leibler divergence KLjQ II e which means that if Zq = — (E) — H[Q{x)] then 

the approximation will become more accurate as Q approaches the Gibbs distribution H]. We denote 
this mean-field approximation to the partition function as Zmf ■ Other tractable approximations to 
the log-partition function can also be used EHH. For simplicity, we use ^mf for the majority of 
the subsequent numerical experiments. 


3.2 Training algorithm 


Our training algorithm for Boltzmann machines is given in Algorithm 1 The algorithm assumes 


that the user has (1) a method Q{v, h) that approximates the model distribution and (2) a family 
of distributions Q{h; v) that estimates the data distribution when the visible units are clamped to 
data vector v. We assume in both cases that the resulting distributions are product distributions. 
Note that P{h\v) is expected to be approximately unimodal for trained models fS] and hence it is 
reasonable to expect that it will provide a good approximation to the true probability. We assume 
the that drawing a sample from Q{v, h) or Q{h; v) costs one operation. Arithmetic operations such 
as addition, multiplication and exponentiation are each assumed to cost a single operation. 
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Algorithm 1 Rejection Sampling Training Algorithm IRS(w, b, d, {xtrain})- 


repeat 

Compute distribution Q from w b and d. 

Compute Zq from Q. 
for each Training vector x G Xtrain do 

Compute distribution Q{h\x) from x, w, b and d. 

Compute from Q{h\x). 

repeat 

Attempt to sample from 
with instrumental distribution Q{h\x) and 
until sample is accepted 

repeat 

Attempt to sample from P/Z using (1) with instrumental distribution Q and Zqka- 
until sample is accepted 

end for 


e using 


( 1 ) 


Compute gradients using expectation values of accepted samples and (10) 
Update weights and biases using a gradient step with learning rate r. 
until Converged or maximum epochs reached. 


Theorem 2. The expected number of query and arithmetic operations required to train a 
£-layer connected DBM containing £ e dges using rejection sampling for A^epoch epochs is 
O (Aepoch-Atrain^^A'S’), if the assumptions of Theorem 1 hold with (1 — e) € f2(l) and the mean-field 
approximation is used to estimate the partition function. 


The proof of the theorem is a straightforward exercise in counting the number of expected steps 
in [Algorithm 1 and is given for completeness in the appendix. 


In contrast, the cost of using greedy contrastive divergence training with CD-fc is 
0(iVepochA^trainfc^^) 0. Thus IRS training will have an advantage over contrastive divergence 
for training sufficiently deep networks if ka G 0(k). It is worth noting that MF-CD training jSl 
also offers advantages for training deep networks but can be less accurate than CD training 1(131 . 


A further advantage of IRS over CD training is that the rejection sampling step can be parallelized. 
Conversely, the k sampling steps used in CD-fc cannot be parallelized. Parallelism can boost the 


accuracy of IRS training without increasing the runtime of the algorithm, as discussed in Table 1 


3.3 Accuracy of gradients 


As with contrastive divergence training, the accuracy of the gradients yielded by IRS training is 
controllable. The two parameters that affect the quality of the gradients are ka and Agamp- As both 
of these quantities approach infinity, the error in the gradient goes to zero. We formalize this in the 
following corollary, which is proven in the appendix. 

Corollary 1. The expected Euclidean norm of the difference between the gradient computed by 
Asamp samples using rejection sampling with instrumental distribution Q using ka and Z « Zq 
for a connected binary Boltzmann machine with £ edges computed using P jZ is 



The proof of Corollary 1 follows directly from Lemma 1 standard error bounds on statistical sam- 

— Atrain io . 


pling and norm inequalities. Note that the choice Ag. 


Algorithm 1 is not necessary. 


Figure 1 [confirms the expectations of Corollary l[by showing that the error in the estimated gradient 
shrinks as 0(l/.^Asamp) if ka is sufficiently large. Furthermore, because there are twice as many 
weights in the RBM with = 6 and rih = 4: as there are in the RBM with = 4 and nt = 3, the 
ratio of the two errors should be a factor of v/2- The numerical experiments on these small RBMs 
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Figure 1: Mean difference between the gradient of the weight vector computed by rejection sam¬ 
pling, ViCest^ the gradient computed directly from (10) for 100 randomly generated RBMs as 
a function of the number of samples considered in the rejection sampling algorithm with k = 10. 
Weights and biases are taken to be Af{0, 1). (Left) Uy = 4, rih — 3 and (right) Uy = 6,nh = 4. 


Uh 

logic {D2{P/Z Qmf)) 

logio {D2{P/Z\\Qy,=2)) 

4 

0.39 ±0.88 

-0.44 ±0.33 

8 

1.7± 1.1 

-0.06 ±0.33 

12 

3.1 ± 1.6 

0.21 ±0.36 

16 

4.6 ± 1.9 

0.34 ±0.41 


Table 2: a-divergences for synthetic RBMs with n„ = 6. Biases and weights are normal with zero 
mean and unit variance. The a = 2 data scales linearly; whereas mean-field scales exponentially. 


agree with this assumption, suggesting that the scaling with £ also agrees with Corollary 1 Also, 
KA — 10 proves to be sufficient for the majority of the data sets considered, while = 4 appears 
to be the threshold below which the quality of the gradient is negatively impacted by ka- 


We examine the value of D 2 obtained using Qmf and Qa =2 case in [Table 2 for a random en¬ 
semble of RBMs in order to assess the benefits of IRS training using Qa= 2 - We choose a large 
weight distribution for the data to emphasize the discrepancies between the two. For more mod¬ 
est weight distributions, the two quantities become comparable (see appendix). We find that using 
the mean-field approximation instead of the optimal distribution increases the expected k predicted 
by Corollary~T| by up to 7 orders of magnitude for a 22 unit RBM. In particular, the values of ka 
needed are sufficiently low such that the majority of these distributions can be exactly prepared from 
Qa =2 using a reasonable number of samples. 


3.4 Accuracy of training 

We will now compare the performance of IRS training algorithm to CD-1 training for small restricted 
Boltzmann machines. We use the following synthetic training data; 


[xi]j = 1 if j < n„/2, else 0 

[x 2 ]j = j mod 2, 


(13) 


and their bitwise complements. We further add Bernoulli noise of strength A7 = 0.1 to each of the 
bits in the training vectors to make the data set more challenging to learn. 


We assess the quality of the training methods by using Algorithm 1 [or CD-I to find an approximation 


to the optimal model parameters and then compute the exact gradients of 0ml) to find the location 
of the true optima that these methods estimate. The data is presented in Figure 2 for a small RBM 
with = 6 and Uh = 4. The data shows that while contrastive divergence finds an optima that is 
on average within 0.13% of that exact ML training can provide, IRS training deviates by 0.0015% 
and continues to converge to the ML optima as Wepochs increases. Further numerical evidence is 
provided in the appendix. Such differences are expected to be even more striking for deep restricted 
Boltzmann machines because IRS does not greedily optimize the weights 111. 
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Figure 2: Difference between the objective functions computed using IRS gradients (left), CD-I 
gradients (right) and the objective function evaluated at the optima found by ML training. Dashed 
lines denote a 95% confidence interval; solid lines denote the mean. For IRS, ka = 800, A^samp = 
A^train = 100, A/^ = 0.1 and Q is an equal mixture of the mean-held and uniform distributions, 
A = 0.05 and the learning rate varies from 0.1 to 0.001 at 10000 epochs. E(Oml) ~ —3.8724. 


4 Outlook 

Our results open a number of further avenues for further inquiry. Although IRS has asymptotic ad¬ 
vantages over existing methods for training DBMs in the presence of sufficiently strong regulariza¬ 
tion, our work only shows that it can provide accurate and efficient approximations to the gradients 
of the training objective under such circumstances. Further work will be needed to examine the 
performance of IRS training in practical machine learning problems. 

Since IRS achieves its goals by combining results from the disparate helds of variational approxi¬ 
mations and deep learning, it is natural to suspect that it could be optimized by going beyond the 
simple unimodal approximations used in the main body. Indeed, we see in the appendix that the 
use of bimodal approximations can substantially reduce the sample complexity of training. Further 
study may reveal even more practical variational approximations to P{x)/Z. 

Our work also illustrates that quantum machine learning may be an important avenue of inquiry 
for understanding machine learning in a broader context. This utility of thinking from a quantum 
perspective is not likely to be unique to training Boltzmann machines since classical computing is 
in a subset of quantum computing and hence every result shown for classical machine learning also 
applies to a subset of quantum machine learning. Conversely, lower bounds and no-go theorems 
proven for the quantum setting also apply to the classical setting, which makes quantum computing 
ideally suited for understanding the limitations and opportunities that physics places on a machine’s 
ability to learn. Just as quantum insights inspired the present work, re-examining other areas of ma¬ 
chine learning through the lens of quantum computing may not only lead to new classical algorithms 
but also provide deep insights into the nature of learning and inference. 
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A Review of Dirac Notation 

We will derive much of the theory behind our method using language from quantum computing. 
This language not only proves to be useful for representing rejection sampling based algorithms, but 
also is useful because it provides a clear method for dequantizing the method of m. The central 
object in quantum computing is the quantum state (which we will take to mean a pure state in the 
following). A quantum state is simply a unit vector in such that the magnitude squared of each 
of its components yields the probability of the corresponding outcome. 
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A quantum state is typically represented (using Dirac notation) as |'(/i) which can be interpreted to 
be a column vector in C^. Similarly, is its Hermitian transpose. The notation is linear, which 
means that if {|j)} is a complete orthonormal vector space on then the state |'(/)) can be written 
as 

N 

1^) = E I-?') ■ (14) 

i=i 

Physically, such states describe the probability distribution of outcomes that emerges when the state 
\ijj) is measured in this basis where each Pj gives the probability of finding the system in basis state 
\j). Classically, this measurement process is equivalent to sampling from a probability distribution 
but in quantum mechanics this is more subtle because the resultant probability distribution changes 
depending on the basis that the system is measured in; whereas in classical applications there is 
implicitly only one such basis. The fact that Dirac notation is basis independent gives it a distinct 
advantage for concisely describing distributions relative to column-vector notation. 

Dirac notation also has an implicit tensor product structure built in, meaning that 

I'tP) \(j)) ■= \ij) ^ \(j)). (15) 


This convention is useful for describing large sets of uncorrelated variables because the tensor prod¬ 
uct structure implicitly captures the lack of correlations between the variables descri bed by IV?) and 


those described by |0). Correlated distributions can be described by combining (14) and (15) 


via 


y]] Oij |m*) luj), (16) 


where \ui) and |uj) represent orthonormal basis vectors that span the space that the variables de¬ 
scribed by \ij}) and \(j)) are supported in. Also, for the probability of drawing a particular sample 
Vj from the marginal distribution over vj is, for example, Pr(uj) = the resultant 

marginal state over the Ui is 


E 




\u^) . 


(17) 


B Proofs 


B.l De-quantization of Quantum Rejection Sampling and Proof of Theorem 1 


While the states in quantum computing are the fundamental objects of the computational model, 
quantum computers also need a complete set of operations that are capable of performing an ar¬ 
bitrary (unitary) transformation on input states. This means that a quantum computer has to have 
the ability to transform an arbitrary input unit vector into any other such vector. For the present 
purposes, however, it suffices to note that probability distributions are first class entities in quantum 
computing and that quantum computers provide a method for preparing any such distribution. As a 
consequence, it should come as no surprise that notation developed to describing quantum devices 
should also be useful for describing classical sampling algorithms. 

Quantum rejection sampling is one of the most important tools used to design quantum algorithms 
not only because it can be used to enable exponential speedups for certain tasks, but also because it is 
very natural in quantum computing El. Let us assume that we have a (potentially un-normalized) 
distribution P that we cannot directly prepare and also have an efficiently preparable distribution Q. 
Furthermore, let us also assume that a constant is known such that P/Q < k. We then prepare the 
state 

If we measure the right-most register, which we will refer to as the coin, to be one then the resultant 
state over the sample register is 


E 


I P{x) 

Y.xP{^) 


^)-E 


p{x) 

z 


c), 


(19) 
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an d thu s if we consider the state post-selected on successfully measuring the right most register 
in (18) to be 0 then we prepare the desired distribution P{x) over the remaining register. The 
probability of success for this is 


Although quantum notation is used in this procedure, there is nothing inherently quantum about it 
as written. In fact, the exponential advantages that quantum algorithms accrued through quantum 
rejection sampling arise only because the distributions P or Q cannot be efficiently computed or 
because the distribution Q cannot be efficiently sampled from. This is summarized in the following 
lemma. 

Lemma 2. Let Q{x) : x G he an efficiently computable probability distribution that can 
be efficiently sampled from. Assume that P{x) : x G can be efficiently computed and 

P{x)/Q(x) < K y X G then the task of drawing samples from the distribution yielded by 
quantum rejection sampling can be efficiently simulated by a classical computer. 


Proof. Let us assume that we are provided with a state as per (18) Drawing a sample from the cor¬ 
rect distribution corresponds to measuring the coin register in |(18)| and conditioned on measuring 1 
the sample register is measured and the result is output as the sample from P{x). Because the partial 
trace is a commutative operation, the order of these two measurements is arbitrary. Instead, we could 
first measure the sample register resulting and if the result is x then the marginal distribution over 

1^) \J^ ~ coin register can be measured and the sample 


the coin is 


P{^) 

Q{x)k 


X will be accepted if and only if the result is 1, which occurs with probability P{x) / {Q{x)k). This 
process is equivalent to the original quantum algorithm. 


By assumption the distribution Q{x) can be sampled from efficiently be a classical computer. This 
means that the first step in the re-ordered quantum algorithm can be efficiently simulated. Next, 
using the fact that P{x)/{Q{x)k) < 1 and that P{x) and Q{x) are efficiently computable it fol¬ 
lows that we can efficiently simulate drawing a sample from the coin register by sampling from a 
Bernoulli distribution with p = This process, also known as rejection sampling, is efficient 

and hence quantum rejection sampling can be efficiently simulated using a classical computer under 
these assumptions. □ 


This consequently shows that the GEQS algorithm for training deep networks given in IH can be 
efficiently simulated and has a direct classical analog in the IRS algorithm. Similarly, the GEQAE 
algorithm in m can also be efficiently simulated but IRS is not a natural analog of it since GEQAE 
uses a manifestly quantum method (known as amplitude estimation) for inferring the expectation 
values needed to train the DBM. This method may be of particular importance because it can lead 
to quadratic reductions in the number of times the database of training vectors needs to be queries, 
which leads to significant cost savings for typical machine learning problems. 


Despite the fact that Lemma 2 shows that a classical computer can often simulate quantum rejec¬ 
tion sampling efficiently quantum computers can nonetheless provide huge advantages for rejection 
sampling. In fact, speedups relative to classical algorithms can arise from the use of quantum sub¬ 
routines such as amplitude amplification to quadratically reduce the rejection rate in the sampling 
process ca. However, since we focus on classical algorithms for rejection sampling we will ignore 
such quantum algorithms in the following. 


Proof of Theorem 1. The proof here follows one given in |[T|. The algorithm described in Lemma 2 


will fail as writted if P/Q > ka because the marginal distribution over the coin register will no 
longer be normalized. This can be addressed, at the price of introducing errors in the distribution, 
by clipping the probabilities used in the coin register to [0,1] for the configurations x G bad where 
P/Q > Zqka- Using Dirac notation, we wish to sample from the following state. 


VQi^) k) (J' 

:Ggood \ V 


P{x) 


Q{x)Zqka 


| 1 ) + “ 


P{x) 


Y V^k)|l)> (20) 


tcGgood y \ y y y y xGbad 

where good = Z 2 " \ bad. The probability of measuring the coin to be 1 is 


P. 


accept 


= E 


Zq 

xGgood ^ xGbad 


Zqka 


( 21 ) 
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We denote the resultant state 


X 


^^xGgood ^jp{x) ix}+x: xGbad ^ ZqkaQ{x)\x) 
yJYlx P{x) + Y.x^h^2^Q^AQ{x) - P{x)) 


( 22 ) 


Now the fidelity of the marginal state that results from post-selected measurement of the coin regis¬ 
ter (i.e. accepting the sample from Q) is 


{P\P) 


^ Sxsgood + Exgbad ZqKaQ{x) 


This shows that there exists a quantum algorithm that has the desired success probabilities and incurs 
error at most e in the resultant distribution. The remainder of the proof then follows by the same logic 
as that used in [Lemma 2| to show that the algorithm can be de-quantized by exchanging the order 
that the coin and sample registers are measured in. Thus there is an equivalent classical algorithm 
under the assumptions that P and Q are efficiently computable and Q can be efficiently sampled 
from. □ 


B.2 Proof of Corollary 1 


Proof of Corollary 1. Let us focus on the problem of approximating the model expectation present 
in the gradient with respect to Wij. The triangle inequality and Theorem 1 imply that if P is the 
probability distribution that is obtained after rejection sampling is performed using ka and {yk : 
k = 1,..., A^samp} are the samples drawn from P{x) in the rejection sampling protocol 


N, 


^ ^ ^x,y(^f.^XiXj ) model 




< 


< 


< 




N, 


samp 


'y ^ ^x^y^k'^X^Xj y ' xj^Xji^Pjxf -f y ^ X^X^PiP) (al2Xj)model 


k=l 


y ' xiXj{P{x) ^xe{vk}/^saTnp) + ^ ' xiXj{P{x) PIZ) 

- 4g{j^4/iVsamp) + '^iP{x) - P/Z) 


e O 


y ' xiXj{P{x) ^a:G{j/fc}/A^samp) 


ZD^jP/ZWQ) 

kaZq 


(24) 


Because the units are binary the variance of XiXj is at most 1, which means that the sampling error 
can be bounded and hence [(24)] is 

1 , ZD2{P/Z\\Q)\ 


o 




samp 


KaZq 


(25) 

□ 


The exact same argument can be applied to the model average and the gradients of the biases. The 
conclusion is identical in each case, that the contribution to the error from sampling and using an 


insufficient value of ka is at most (25) For a connected graph, the maximum number of components 
of any of these vectors is £. The triangle inequality and the fact that || • |j 2 < V^|| • Umax then gives 
us our result. 


C Additional Numerics 

C.l Scaling with A 

In the main text we examined the scaling of D 2 for random RBMs with edges whose weights are 
drawn from Af{0, 1) for different sized graphical models. Such Boltzmann machines are not ex¬ 
pected to be typical of those that emerge from training in the presence of regularization, which is 
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Figure 3: The total probability mass that is well approximated, meaning P{v, h) < kaQ{v, h) as a 
function of ka for an RBM with = 6 and Uh = 8. The left plot corresponds to A = 0.1 and the 
right plot A = 0.01. 


expected to lead to substantial weight decay for large models. We examine the accuracy of rejection 
sampling, as a function of ka, for both mean-field approximations and Qa =2 in Figure 3 Specifi¬ 
cally, consider an RBM with riy = 6 and Uh = 8 whose weights were found by exact training and 
training vectors drawn from the synthetic set described in the main text with Af = 0.1. We also take 
Zmf = Z in order to simplify the comparison. 


The data in Figure 3 shows that for A = 0.1, the mean values of the probability such that P(u, h) > 
kQ{v, h) are graphically indistinguishable for both the mean-field approximation and Qa= 2 - This 
is to be expected since A = 0.1 corresponds to relatively strong regularization and it is perhaps 
not surprising that if the Gibbs distribution is nearly a product distribution that optimizing either 
KL{Q\\P) or D 2 {P\\Q) should lead to comparable results. The sum of the bad probabilities was 
found to fall off, for large ka, as for Qa =2 and A = 0.01. In contrast the asymptotic scaling 
for Qmf for A = 0.01 is unclear for this regularization constant as the values of A considered are 
insufficient to see the asymptotic behavior of the curve. 


There are substantial differences between the fraction of conhgurations that are correctly handled 
for large ka for A = 0.01. As expected by Corollary 1, Qa =2 outperforms Qmf in the asymptotic 
limit. It perhaps is unsurprising that Qmf can outperform Qa =2 for modestly small ka because 
choosing the distribution to minimize the average value of P{v, h)/Q{v, h) may not minimize the 
tail probabilities as effectively as one may like owing to the looseness of the Markov inequality. 
In practice, this suggests that minimizing for even more general a divergences may be useful for 
trading off asymptotic versus short-time performance of the sampling algorithm. 


We examine the same problem but for fixed A and variable nt in Figure 4 We note that for these 


networks that the value of ka needed to obtain 99% of the probability mass scales roughly as 0.27n/i 
for A = 0.1. This scaling should be taken with a grain of salt as we do not have enough data to 
meaningfully extrapolate to large system sizes. However, the salient feature is that there is no sign 
of exponential divergence despite the number of edges in the graphical model tripling. 


For A = 0.01 we notice no such trend: the sum of the good probabilities alnn = 8 is on average 
less than that observed for rih = 12. This is likely because the model contains 48 edges for rih = 8 
and if A = 0.01 the effect of the regularization term on Oml is potentially less signihcant than it 
would be if the same distribution of weights were taken at rih — 12 where 72 edges are present. 
As such it is not helpful to consider how the cost of refining the mean-held approximation into the 
Gibbs state scales with the size of the graphical model. More important features such as the sparsity 
of the graph and the presence or absence of frustration are expected to better determine the viability 
of these variational approximations 


Another interesting feature is that at k = 1 the mean values of the sum of the good probabilities in 
the distribution that results from rejection sampling for nt = 8 and nu = 12 are within (1.5 ± 0.2)% 
of each other if A = 0.1. This is despite the fact that the Hilbert space for the rih = 12 case is 16 
times larger and the number of edges in the model is 50% larger. This shows that in the presence 
of regularization the error in variational approximations to the Gibbs distribution need not strongly 
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Figure 4; The mean value and 95% confidence interval for the total sum of the probability for the 
configurations where P{v, h) > KQa= 2 {v, h) for A = 0.1 (top) and A = 0.01 (bottom) for trained 
RBMs with rift, = 4 (left) rift = 8 (center) rift = 12 (right). 


depend on the size of the graph. Similar results have been noticed for mean-field state preparations 
in IT]. The results for A = 0.01 is much more significant with an average difference of (11.6±0.2)%. 
We suspect these differences occur because the graphs considered are not yet large enough for weak 
regularization to push the Gibbs state towards an approximately unimodal distribution. 


C.2 Scaling with ka 

The quantity ka is perhaps the most important factor for determining the viability of our method 
relative to contrastive divergence because it dictates the success probability of the rejection step. 
Here we will provide numerical evidence in small samples that illustrates that small values of ka 
provide comparable, or greater, accuracy than contrastive divergence training. For all these results 
we again use an equal mixture of Qmf the uniform distribution as our instrumental distribution. 
We anticipate that the use of Qa =2 will tend to result in better results for a fixed value of ka- 

Figure 5 shows the difference in the value of the objective functions obtained relative to those found 
by exactly following the ML-objective function for an RBM with Uy = 6 and lift = 4 and A = 0.05. 
We observe for that data that ka = 75 suffices to provide a mean discrepancy that is comparable 
to contrastive divergence. It is important to note though that a substantial fraction of the results 
for even ka = 50 are dramatically better than the results for contrastive divergence; however, the 
worst cases are nearly twice as bad. The results for ka < 100 show evidence of saturating and by 
computing the values that they saturate at we see that the data agrees well with an scaling 

and has relatively poor agreement with powerlaw or linear scalings. This captures the observed fact 
that the quality of the gradients rapidly improves as ka is increased. In particular, for ka = 200, 
the discrepancies between the worst case scalings and the best case scalings of the data collapse and 
nearly all of the 1000 examples tested were found to perform better than contrastive divergence. 

These results do reveal an interesting feature of contrastive divergence, namely its consistency. The 
best and worst case performances seem to be tightly clustered relative to those observed for our 
sampling algorithm. It also seems to perform better for the first few epochs of optimization than 
IRS does, even in the limit of large ka- This illustrates that contrastive divergence is likely to re¬ 
main an algorithm of choice for serial (rather than parallel) training environments where variational 
approximations to the Gibbs states are expected to abjectly fail. 

We examine a similar case with much weaker regularization in |Figure'6| There we note that much 
larger values of ka are needed to obtain good approximations to the gradient. In particular, we 
observe that ka = 400 is the first example where the best case performance beats that of contrastive 
divergence. The data set is otherwise qualitatively similar to that considered in [Figure 5| except 
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Figure 5: Mean and 95% confidence interval for the discrepancy between the values of the ML 
training objective found using IRS training and CD-I training for RBMs with = 6 and Uh = 4, 
A = 0.05 and = 50 (Top left), ka = 75 (Top right), ka = 100 (Middle left), ka = 200 (Middle 
right) and contrastive divergence training (Bottom). 


no evidence of training plateauing is observed within the number of epochs considered (with the 
exception of some of the data for ha = 200). 


D Hedging strategies 


In the main text, we showed that choosing our p roduct dis tribution Q to minimize the D 2 {P\\Q) 
rather than KL((3||P). However, as we noted in Figure 3 the mean-field approximation may ac¬ 
tually yield a better approximation to the Gibbs state than Qa =2 does if ka is small. The reason 
for this is that D 2 attempts to minimize the average ratio between the two, but in practice it may 
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Figure 6: Mean and 95% confidence interval for the discrepancy between the values of the ML 
training objective found using IRS training and CD-I training for RBMs with = 6 and Uh = 4, 
A = 0.01 and ka — 200 (Top left), ka = 400 (Top right), ka = 800 (Middle left), ka = 1600 
(Middle right) and contrastive divergence training (Bottom). 


not necessarily model the high probability regions of the distribution accurately. In contrast, the 
mean-field distribution aims to find the distribution that is closest to the Gibbs distribution but re¬ 
quires a large value of ka to accurately model the tails of the probability distribution. This begs the 
question of whether choosing a different instrumental distribution that combines the best features of 
both distributions can be used. 

A strategy proposed in Cl is to choose the instrumental distribution to be a combination of the mean- 
field distribution and one that captures the tail probability more effectively. In Cl they mix uniform 
distribution with the mean-field distribution. While this works well for small systems, it is not 
expected to work well for high-dimensional systems because the prediction shrinks exponentially 


15 








Figure 7; The sum of the probability mass of the configurations that is correctly handled by the 
rejection for hedging strategies using different values of 7 for an RBM with n„ = 6 and Uh = 8 
with A = 0.01. The weights and biases of the RBM were those at the local optima of Oml- 


with the dimension of the Hilbert space that the probability distribution is supported over. Instead, a 
more sensible approach is to combine the mean-field and Qa =2 in the following way 

Q{v, h) = 7(5mf(w, ft.) + (1 - 7)(3 q=2(u, ft), (26) 

for 7 e [0,1]. 


We examine this strategy in Figure 7 where we observe that for small « 1 the quality of the 


approximation yielded at 7 = 1/2 is comparable to that at 7 = 1. This is surprising because 
averaging the mean-field approximation with one that is known to give a worse approximation may 
be expected to result in an inferior approximation. For larger values of ka, the 7=1/2 hedged 
approximation outperforms either Qmf or Qa =2 individually. In fact, at ~ 211 the distribution 
with 7 = 1/2 outperforms both of the distributions by roughly 45%. This not only suggests that 
hedging can lead to substantial improvements but also suggests that choosing different values of a 
to interpolate between the performance of a = 2 and a = 0 (mean-field) may also improve the 
performance of IRS for small kalues of ka- 


E Detailed algorithm for computing gradients 

We provide a more detailed algorithm below for computing the gradient of the ML objective function 
using IRS. The algorithm utilizes subroutines Q and Q that provide an estimate of the probability 
of a given configuration of hidden and visible units. These functions will often correspond to a 
tractable approximation to the Gibbs distribution such as the mean-field approximation or a product 
distribution that minimizes the a = 2 divergence with the Gibbs state. We also assume that a 
sampling procedure is known for these distributions. 
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Algorithm 2 Algorithm for estimating V Oml ■ 


Input: Initial model weights w, visible biases b, hidden biases d, ka,^ set of training vectors Strain, 
a regularization term A, a learning rate r and the functions Q{v, h), Q{h; v), Zq, ZQ^h.^y 
Output: gradMLw, gradMLb, gradMLd. 

for i = 1 : Atrain do 
success t— 0 

while success = Odo > Draw samples from approximate model distribution. 

Draw sample (w, h) from Q{v, h). 

Es — E{y^ E) 

Set success to 1 with probability min(l, j{Z qKaQ{v, h))). 
end while 

modelv[i] t— v. 
modelH[i] t— h. 
success t— 0 

U t Xtrain [i] • 

while success=Odo > Draw samples from approximate data distribution. 

Draw sample h from Q{h; v). 

Es t— E{v, h). 

Set success to 1 with probability min(l, h))). 

end while 
dataV[f] t— V. 
dataH[i] <— h. 

end for 
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